Automated Data Collection with R Blog

htmltab v.0.6.0

Posted by Christian Rubba on July 23, 2015

The next version of the htmltab package has just been released on CRAN and GitHub. The goal behind htmltab is to make the collection of structured information from HTML tables as easy and painless as possible (read about the package here and here). The most recent update got rid of many smaller bug fixes, inconsistencies and brings significant internal optimization of the code to increase not only the robustness of the function but also the level of verbosity in case something goes wrong. A complete list of the changes can be checked up here.

install.packages("htmltab")
#or
devtools::install_github("crubba/htmltab")

Header information that appear in the body

With v.0.6.0 a new features has been introduced that will allow users to process header information that appear somewhere in the table body. This is not an uncommon design choice and the question how such tables can be processed with R has been debated on stackoverflow. I illustrate this problem with a table from the American National Weather Service. Below is a crop of this table which should give you the basic idea:

The task is to assemble a table where the the model type information that appear in the body (global models, regional models, ...) populate a seperate column in the final table. To this end, htmltab() has been extended to accept in its header argument a formula-like expression to signify the different dimensions of header information. The basic format of this formula interface is level1 + level2 + level3 + ... , where you can express the position of each element either numerically or with a character vector for an XPath expression that identifies the respective element. So, for the table above, we pass 1 + "//tr/td[@colspan = '7']" which expresses that the first level header appears in row 1 and the second level headers appear in cells which have a colspan attribute of 7:

nomads <- htmltab(doc = "http://christianrubba.com/htmltab/ex/nomads.html", which = 1, header = 1 + "//tr/td[@colspan = '7']")
## Warning: The code for the HTML table you provided contains invalid table tags ('//trbody'). The following transformations were applied:
## 
##               //trbody -> //tbody 
## 
## If you specified an XPath that makes a reference to this tag, this may have caused problems with their identification.
## Warning: The code for the HTML table you provided is malformed. Not all
## cells are nested in row tags (<tr>). htmltab tried to normalize the table
## and ensure that all cells are within row tags. If you specified an XPath
## for body or header elements, this may have caused problems with their
## identification.
head(nomads, 22)
##           Header_1                                     Data Set     freq
## 3    Global Models                                         GDAS  6 hours
## 4    Global Models                              GFS 0.25 Degree  6 hours
## 5    Global Models            GFS 0.25 Degree (Secondary Parms)  6 hours
## 6    Global Models                              GFS 0.50 Degree  6 hours
## 7    Global Models                              GFS 1.00 Degree  6 hours
## 8    Global Models                 GFS Ensemble high resolution  6 hours
## 9    Global Models           GFS Ensemble Precip Bias-Corrected    daily
## 10   Global Models  GFS Ensemble high-resolution Bias-Corrected  6 hours
## 11   Global Models  GFS Ensemble NDGD resolution Bias-Corrected  6 hours
## 12   Global Models         NAEFS high resolution Bias-Corrected  6 hours
## 13   Global Models         NAEFS NDGD resolution Bias-Corrected  6 hours
## 14   Global Models                             NGAC 2D Products    daily
## 15   Global Models                             NGAC 3D Products    daily
## 16   Global Models          NGAC Aerosol Optical Depth Products    daily
## 17   Global Models        Climate Forecast System Flux Products  6 hours
## 18   Global Models Climate Forecast System 3D Pressure Products  6 hours
## 20 Regional Models                            AQM Daily Maximum 06Z, 12Z
## 21 Regional Models                     AQM Hourly Surface Ozone 06Z, 12Z
## 22 Regional Models                                 HIRES Alaska    daily
## 23 Regional Models                                  HIRES CONUS 12 hours
## 24 Regional Models                                   HIRES Guam 12 hours
## 25 Regional Models                                 HIRES Hawaii 12 hours
##    grib filter http     gds-alt
## 3  grib filter http OpenDAP-alt
## 4  grib filter http OpenDAP-alt
## 5  grib filter http        <NA>
## 6  grib filter http OpenDAP-alt
## 7  grib filter http OpenDAP-alt
## 8  grib filter http OpenDAP-alt
## 9  grib filter http OpenDAP-alt
## 10 grib filter http OpenDAP-alt
## 11 grib filter http OpenDAP-alt
## 12 grib filter http OpenDAP-alt
## 13 grib filter http OpenDAP-alt
## 14 grib filter http           -
## 15 grib filter http           -
## 16 grib filter http           -
## 17 grib filter http           -
## 18 grib filter http           -
## 20 grib filter http OpenDAP-alt
## 21 grib filter http OpenDAP-alt
## 22 grib filter http OpenDAP-alt
## 23 grib filter http OpenDAP-alt
## 24 grib filter http OpenDAP-alt
## 25 grib filter http OpenDAP-alt

In the data frame that was produced, we see that the model labels that appeared throughout the body now populate a seperate column. Such a format is almost always more useful for further exploration of the data frame either through summary statistics or visualizations. More generally, there is no limit with respect to the level of nestedness that the table exhibits. The only requirement for this feature to work is that the header levels must be strictly nested and you can specify the exact position of the elements through a numeric vector or an XPath.

For more information have a look at the package vignette. And as always, I am happy to hear about any problems you experience with the package.